Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshots refactoring #1970

Merged
merged 37 commits into from
Apr 29, 2024
Merged

snapshots refactoring #1970

merged 37 commits into from
Apr 29, 2024

Conversation

battlmonstr
Copy link
Contributor

@battlmonstr battlmonstr commented Apr 16, 2024

Previously the Snapshot subclasses were responsible for everything:

  • snapshot serialization formats
  • related indexes management
  • data access layer

Most of the logic was copy-pasted 3 times with minor variations.

Now the Snapshot logic is reorganized to decouple storage engine from the Ethereum domain objects.
The copy-paste is reduced.
The same design structure can be applied to support more Snapshot/Index types in the future.

The Repository data model is changed to contain a continuous sequence of full snapshot bundles (see (5) below).

2024-04-19 17 18 21

Changes:

  1. SnapshotWordDeserializer represents the snapshot word format. Multiple deserializers can be used for the same .seg (e.g. to get words as raw RLP).

  2. Snapshot represents a .seg file read-only view and provides iteration over entries with a custom SnapshotWordDeserializer.

Example to open a snapshot:

Snapshot snapshot{path};
snapshot.reopen_segment();
  1. Index is decoupled from Snapshot. It is a wrapper of RecSplitIndex.

Example to open an index:

Index index{path.index_file()};
index.reopen_index();
  1. Data access layer is decoupled from Snapshot and consists of snapshot readers and query objects.

A snapshot reader is a type-safe iterator over Snapshot words. A reader is bound with a particular SnapshotWordDeserializer. Readers don't depend on indexes. They assume that the snapshot was open for reading.

Example to iterate over headers:

for (const BlockHeader& header : HeaderSnapshotReader{snapshot}) { ... }

Each query is a separate struct. Queries usually depend on indexes. In this case they assume that the index was open for reading.

Common queries are implemented in basic_queries.hpp: FindByIdQuery, FindByHashQuery, RangeFromIdQuery. They are then bound with a particular reader to produce results:

struct HeaderFindByHashQuery : public FindByHashQuery<HeaderSnapshotReader> {}

Example to find a header by hash:

std::optional<BlockHeader> header_opt = HeaderFindByHashQuery{snapshot, index}.exec(hash);

Some queries have custom logic, for example: BodyTxsAmountQuery.

  1. Since the indexes are decoupled, they have to be stored separately in the Repository. In order to cope with complexity I've decided to simplify Repository to a different model. Previously the repository had snapshots organized by type, then by paths, and inside each snapshot it contained related indexes. It is changed to a sequence of full bundles organized by block_from sequentially and continuously. A full bundle must have all snapshot and index types for a given block range. It means that Repository doesn't handle "partial bundles" anymore, and is more strict to not having gaps. SnapshotBundle is like a database schema:
struct SnapshotBundle {
    Snapshot header_snapshot;
    Index idx_header_hash;
    
    Snapshot body_snapshot;
    Index idx_body_number;
    
    Snapshot txn_snapshot;
    Index idx_txn_hash;
    Index idx_txn_hash_2_block;
}

@battlmonstr battlmonstr force-pushed the pr/snap_ref2 branch 12 times, most recently from e7edb0f to 01c1d2a Compare April 19, 2024 13:59
@battlmonstr battlmonstr marked this pull request as ready for review April 19, 2024 15:21
@battlmonstr battlmonstr added snapshots Framework for BitTorrent-based snapshots maintenance Some maintenance work (fix, refactor, rename, test...) labels Apr 19, 2024
@battlmonstr battlmonstr force-pushed the pr/snap_ref2 branch 3 times, most recently from 35c156e to 0b449ef Compare April 24, 2024 16:33
@battlmonstr
Copy link
Contributor Author

Making read_senders=true always has no tangible effect on performance.
This test on a 40 Gb transactions file shows it (CMAKE_BUILD_TYPE=RelWithDebInfo):

TEST_CASE("tt") {
    Snapshot snapshot{*SnapshotPath::parse("/Volumes/WD4K/mainnet/snapshots/v1-017500-018000-transactions.seg")};
    snapshot.reopen_segment();
    TransactionSnapshotReader reader{snapshot};
    for ([[maybe_unused]] auto& t : reader) {}
}

with tx.set_sender() in decode_word_into_tx():

  1. run 1: 3:10
  2. run 2: 3:06
  3. run 3: 3:10

without tx.set_sender()/senders_data in decode_word_into_tx():

  1. run 1: 3:06
  2. run 2: 3:05
  3. run 3: 3:11

@battlmonstr
Copy link
Contributor Author

battlmonstr commented Apr 28, 2024

@canepat I was able to execute pre-downloaded snapshots from scratch to the last block I have in snapshots (18.8M - 1) locally on macOS. It took about 4 days.

  INFO [04-28|14:42:08.162 UTC] [12/12 Finish]                           op=Forward from=0 to=18799999 span=18799999 
  INFO [04-28|14:42:08.163 UTC] ExecutionPipeline                        Forward done
  INFO [04-28|14:42:08.257 UTC] PoSSync: Waiting for blocks... from=18'799'999

@canepat
Copy link
Member

canepat commented Apr 29, 2024

Great work!

@canepat canepat merged commit 20336c9 into master Apr 29, 2024
5 checks passed
@canepat canepat deleted the pr/snap_ref2 branch April 29, 2024 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Some maintenance work (fix, refactor, rename, test...) snapshots Framework for BitTorrent-based snapshots
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants